Demystifying protein annotations: toward increasing the compatibility of different corpora

نویسندگان

  • Yue Wang
  • Jin-Dong Kim
  • Rune Sætre
  • Sampo Pyysalo
  • Tomoko Ohta
  • Jun’ichi Tsujii
چکیده

While there are a number of corpora with protein annotations, the annotations in different corpora are not compatible with each other. It is, however, not yet well understood how they are different and how the incompatibilities can be overcome. The situation discourages utilization of the corpora in a united way. It also indicates that even within individual corpora, the actual annotations are not well understood. We first compare the protein annotations of two corpora, GENIA and GENETAG. Based on the result, we propose several strategies to increase the cross-corpus compatibility. Experimental results show that the proposed strategies are effective and the incompatibility of the protein annotations between the two corpora can be removed if we properly consider their differences.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Raising the Compatibility of Heterogeneous Annotations: A Case Study on

While there are several corpora which claim to have annotations for protein references, the heterogeneity between the annotations is recognized as an obstacle to develop expensive resources in a synergistic way. Here we present a series of experimental results which show the differences of protein mention annotations made to two corpora, GENIA and AImed.

متن کامل

The Effects of Multimedia Annotations on Iranian EFL Learners’ L2 Vocabulary Learning

In our modern technological world, Computer-Assisted Language learning (CALL) is a new realm towards learning a language in general, and learning L2 vocabulary in particular. It is assumed that the use of multimedia annotations promotes language learners’ vocabulary acquisition. Therefore, this study set out to investigate the effects of different multimedia annotations (still picture annotatio...

متن کامل

Multiple Annotations of Reusable Data Resources: Corpora for Topic Detection and Tracking

Responding to demands for very large, easily accessible, reusable news corpora to support research in the topic detection and tracking paradigm, the Linguistic Data Consortium created the TDT corpora. In addition to supporting research in the Topic Detection and Tracking program, the TDT corpora were collected and annotated with an eye toward reuse and re-annotation. Their value is confirmed in...

متن کامل

Interoperability of Corpora and Annotations

This paper describes the application of OWL and RDF to address the interoperability of linguistic corpora and linguistic annotations within such corpora. Interoperability of linguistic corpora involves two aspects: Structural interoperability (annotations of different origin are represented using the same formalism) and conceptual interoperability (annotations of different origin are linked to ...

متن کامل

Linked annotations: a middle ground for manual curation of biomedical databases and text corpora

Annotators of text corpora and biomedical databases carry out the same labor-intensive task to manually extract structured data from unstructured text. Tasks are needlessly repeated because text corpora are widely scattered. We envision that a linked annotation resource unifying many corpora could be a game changer. Such an open forum will help focus on novel annotations and on optimally benefi...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009